This analysis examines simulated income data for Hungary, focusing on the relationships between income and various demographic factors including age, location, occupation, and gender. The dataset is simulated to reflect realistic patterns while maintaining a manageable size for analysis.
The data was simulated by the data_simulation.R script.
The data is available in the hungarian_income_data.csv
file.
Important things to note about the generated data:
Only the 8 most populated cities of Hungary are taken into count weighted by their population. List of cities: Budapest, Debrecen, Szeged, Miskolc, Pécs, Győr, Szombathely, Eger.
Only the 10 most common occupations are taken into count weighted by their frequency in the workforce. List of occupations: Software Developer, Teacher, Doctor, Sales Representative, Engineer, Accountant, Nurse, Manager, Chef, Driver.
The age distribution is generated by a beta distribution with parameters \(\alpha = 2\) and \(\beta = 3\) and multiplied by 95 to put the end result in the desired range. The beta distribution with the aforementioned parameters skews the age distribution towards younger ages, which is more realistic.
There are three groups of people categorized by their age:
Underage: each person has a random age at which they start working between 14 and 24.
Working age: 19-67
Pension age: each person has a random retirement age between 60 and 75.
Under 18 people have no income.
Working age people have a regular income based on their age, occupation, city, and gender.
Pension age people have a pension based on their occupation and city.
All working age people are considered to be employed.
The income of a working age man is 20.000 HUF higher than the income of a working age woman in the same occupation, city, and age group.
library(forcats)
data <- data %>%
mutate(age_group = cut(age,
breaks = seq(0, 100, by = 2),
right = FALSE,
include.lowest = TRUE,
labels = seq(0, 98, by = 2)))
dem_pyramid <- data %>%
group_by(age_group, gender) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(count = ifelse(gender == "Male", -count, count))
ggplot(dem_pyramid, aes(x = age_group, y = count, fill = gender)) +
geom_bar(stat = "identity", width = 0.8, color = "black") +
scale_y_continuous(labels = abs, expand = expansion(mult = c(0.05, 0.05))) +
scale_fill_manual(values = c("Male" = "#00BFFF", "Female" = "#FF3B3B")) +
coord_flip() +
labs(title = "Population Pyramid of Simulated Hungarian Data",
x = "Age Group",
y = "Count",
fill = "Gender") +
custom_theme +
theme(legend.position = "top",
axis.text.y = element_text(size = 10, face = "bold"),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20))We employ two steps to clean the data: inter quartile range (IQR) outlier detection, and clustering.
We use IQR outlier detection to reduce the noise of the data.
We employ K-Means clustering to find the three major demographics groups: unemployed/young, working age, retired.
After cleaning the data and finding the three demographic groups, we solely focus on the middle cluster, the working age people, as this study aims to analyze the income of the Hungarian population and it would be nonsensical to analyze unemployed people or retired people as if their pension was a salary.
detect_outliers <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
return(x < lower_bound | x > upper_bound)
}
outliers <- detect_outliers(data$income)
data_clean <- data[!outliers, ]
set.seed(42)
income_age_matrix <- data_clean %>%
select(income, age) %>%
scale()
kmeans_result <- kmeans(income_age_matrix, centers = 3, nstart = 25)
data_clean$cluster <- kmeans_result$cluster
cluster_summary <- data_clean %>%
group_by(cluster) %>%
summarise(
mean_income = mean(income),
.groups = 'drop'
) %>%
arrange(mean_income)
cluster_labels <- c("Working Age", "Pension Age", "Unemployed/Young")
data_clean$income_group <- factor(data_clean$cluster,
labels = cluster_labels[order(cluster_summary$mean_income)])
ggplot(data_clean, aes(x = age, y = income, color = income_group)) +
geom_point(alpha = 0.5) +
scale_color_viridis_d() +
labs(title = "Age vs Income by Cluster",
x = "Age",
y = "Income (HUF)",
color = "Income Group") +
custom_themesummary_stats <- summary(data)
kable(summary_stats, caption = "Summary Statistics of the Dataset (Outliers Removed)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| age | city | occupation | gender | income | starting_age | retirement_age | age_group | cluster | income_group | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :14.00 | Length:7375 | Length:7375 | Length:7375 | Min. :379808 | Min. :14.00 | Min. :60.00 | 30 : 397 | Min. :1 | Working Age :7375 | |
| 1st Qu.:29.00 | Class :character | Class :character | Class :character | 1st Qu.:529234 | 1st Qu.:18.00 | 1st Qu.:65.00 | 32 : 377 | 1st Qu.:1 | Pension Age : 0 | |
| Median :39.00 | Mode :character | Mode :character | Mode :character | Median :579982 | Median :19.00 | Median :67.00 | 34 : 377 | Median :1 | Unemployed/Young: 0 | |
| Mean :40.14 | NA | NA | NA | Mean :582750 | Mean :18.89 | Mean :67.07 | 28 : 369 | Mean :1 | NA | |
| 3rd Qu.:50.00 | NA | NA | NA | 3rd Qu.:632598 | 3rd Qu.:20.00 | 3rd Qu.:69.00 | 42 : 365 | 3rd Qu.:1 | NA | |
| Max. :73.00 | NA | NA | NA | Max. :864318 | Max. :24.00 | Max. :75.00 | 40 : 360 | Max. :1 | NA | |
| NA | NA | NA | NA | NA | NA | NA | (Other):5130 | NA | NA |
By the following income distribution plot, we can clearly see that on average a man has a higher income than a woman. This does not yet mean that given equal positions a man earns more money. However, it is indicative that we should further analyze this aspect of the data.
ggplot(data %>% filter(age >= 18), aes(x = income, fill = gender)) +
geom_density(alpha = 0.6) +
scale_fill_viridis_d() +
labs(title = "Income Distribution by Gender",
x = "Income (HUF)",
y = "Density") +
custom_themeThe following plot shows how the income is distributed against the age. An important thing to note is that as a person ages, their income increase. However it plateaus after a point, moreover, it even decreases in certain cases.
ggplot(data, aes(x = age, y = income, color = gender)) +
geom_point(alpha = 0.1, width = 0.2) +
scale_color_viridis_d() +
labs(title = "Income Distribution by Age",
x = "Age",
y = "Income (HUF)") +
custom_themeTo get a better grasp of how the income distribution is made up, we can split the data by occupation, giving us a new perspective into how certain occupation are more handsomely rewarded. We can see that Software Developers and Doctors have the highest income, compared to Sales Representatives who earn a lower income.
ggplot(data %>% filter(age >= 18), aes(x = reorder(occupation, income, FUN = median), y = income, color = occupation)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(alpha = 0.1, width = 0.2) +
scale_color_viridis_d() +
coord_flip() +
labs(title = "Income Distribution by Occupation",
x = "Occupation",
y = "Income (HUF)") +
custom_themeThe following plot, like the one before, split the data. However, now we are analyzing how the city in which the person works at contributes to their salary. It is hard not to notice that the average person working in the capital, Budapest, enjoys a higher income compared to other cities.
ggplot(data %>% filter(age >= 18), aes(x = reorder(city, income, FUN = median), y = income, fill = city)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, alpha = 0.5) +
scale_fill_viridis_d() +
coord_flip() +
labs(title = "Income Distribution by City",
x = "City",
y = "Income (HUF)") +
custom_themeincome_by_category <- data %>%
filter(age >= 18) %>%
group_by(occupation, city, gender) %>%
summarise(
mean_income = mean(income),
count = n(),
.groups = "drop"
) %>%
arrange(desc(mean_income))
# heatmap
ggplot(income_by_category, aes(x = city, y = occupation, fill = mean_income)) +
geom_tile() +
scale_fill_viridis(name = "Mean Income (HUF)") +
facet_wrap(~gender) +
labs(title = "Mean Income by Occupation, City, and Gender",
x = "City",
y = "Occupation") +
custom_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold"))top_earners <- income_by_category %>%
arrange(desc(mean_income)) %>%
head(10)
kable(top_earners,
caption = "Top 10 Highest Earning Combinations",
digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| occupation | city | gender | mean_income | count |
|---|---|---|---|---|
| Software Developer | Budapest | Male | 745832 | 99 |
| Software Developer | Budapest | Female | 722333 | 100 |
| Doctor | Budapest | Male | 720956 | 61 |
| Doctor | Budapest | Female | 696091 | 77 |
| Software Developer | Debrecen | Male | 695662 | 43 |
| Software Developer | Szeged | Male | 694999 | 30 |
| Manager | Budapest | Male | 692519 | 107 |
| Software Developer | Szeged | Female | 684845 | 43 |
| Engineer | Budapest | Male | 675140 | 133 |
| Software Developer | Debrecen | Female | 674727 | 40 |
Previously we have seen from a plot that a man on averages is payed more than a woman. To test this result we propose a two sample t-test (assuming that both samples come from a normal distribution), where our nulhypothesisis that the mean male income is equal to the mean female income. Before employing a two sample t-test we must first investigate if the variance of the two samples differ significantly, for such we use an F-test.
##
## F test to compare two variances
##
## data: income by gender
## F = 0.99942, num df = 3636, denom df = 3737, p-value = 0.986
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9369317 1.0661025
## sample estimates:
## ratio of variances
## 0.999418
Given the above results we accept that the variance of the two samples are equal, because we did not find significant proof to state that the variances differ. Thus, we proceed our investigation with the assumption that the variance of the distributions from which the two samples are drown from are equal.
# Test if there's a significant difference in income between genders
t_test_result <- t.test(income ~ gender, data = data, alternative="less", var.equal=TRUE)
print(t_test_result)##
## Two Sample t-test
##
## data: income by gender
## t = -10.655, df = 7373, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is less than 0
## 95 percent confidence interval:
## -Inf -15335.5
## sample estimates:
## mean in group Female mean in group Male
## 573558.0 591693.7
The above results show that the p-value is way below our target of \(0.05\), which means we can state with outmost certainty that we found significant proof that the mean income in the female group is less than the mean income in the male group.
Living in the capital Budapest, it is apparent to see when visiting other cities in Hungary that there are more young people in Budapest than in the province. This could be caused by many factors, but certainly one is that Budapest has the most and the best universities in Hungary. Thus, many young people from other cities in Hungary, relocate to Budapest to get higher education.
We wish to test if our presumptions are correct, by testing if the mean age across different cities is equal or not. To test this, we employ a t-test, with the assumption that the age in every city corresponds to a normal distribution, where we compare the mean age of every city to that of Budapest.
# compare each city's mean age with Budapest's mean age
budapest_age <- data$age[data$city == "Budapest"]
other_cities <- unique(data$city[data$city != "Budapest"])
t_test_results <- data.frame(
City = character(),
t_statistic = numeric(),
p_value = numeric(),
mean_diff = numeric(),
stringsAsFactors = FALSE
)
for (city in other_cities) {
city_age <- data$age[data$city == city]
t_test <- t.test(budapest_age, city_age)
t_test_results <- rbind(t_test_results,
data.frame(
City = city,
t_statistic = t_test$statistic,
p_value = t_test$p.value,
mean_diff = mean(budapest_age) - mean(city_age)
))
}
kable(t_test_results,
caption = "T-test Results: Comparing Mean Ages with Budapest",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| City | t_statistic | p_value | mean_diff | |
|---|---|---|---|---|
| t | Miskolc | 0.5700 | 0.5688 | 0.3150 |
| t1 | Pécs | -0.9066 | 0.3648 | -0.5335 |
| t2 | Debrecen | -0.4749 | 0.6349 | -0.2209 |
| t3 | Szeged | -0.9718 | 0.3313 | -0.5160 |
| t4 | Eger | 1.2779 | 0.2017 | 0.7951 |
| t5 | Győr | -0.1126 | 0.9104 | -0.0671 |
| t6 | Szombathely | 0.2632 | 0.7925 | 0.1824 |
The above results show that the p-value for any pair is no way near our target of \(0.05\). Thus we accept the null hypothesis for each test and say that we did not find statistically significant evidence that the mean age of a city is different from the of Budapest.
Moreover, we can visualize the age distribution across cities with a box plot for each city. In the following figure we see that it appears to have no difference on the age distribution which city you live in.
ggplot(data, aes(x = reorder(city, age, FUN = mean), y = age, fill = city)) +
geom_boxplot(alpha = 0.7) +
scale_fill_viridis_d() +
coord_flip() +
labs(title = "Age Distribution Across Cities",
x = "City",
y = "Age") +
custom_themeage_by_city <- data %>%
group_by(city) %>%
summarise(
mean_age = mean(age),
sd_age = sd(age),
n = n(),
.groups = "drop"
) %>%
arrange(desc(mean_age))
kable(age_by_city,
caption = "Age Statistics by City",
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| city | mean_age | sd_age | n |
|---|---|---|---|
| Pécs | 40.6 | 13.2 | 619 |
| Szeged | 40.6 | 13.4 | 836 |
| Debrecen | 40.3 | 13.2 | 1155 |
| Győr | 40.2 | 13.1 | 592 |
| Budapest | 40.1 | 13.0 | 2574 |
| Szombathely | 39.9 | 12.8 | 397 |
| Miskolc | 39.8 | 13.0 | 703 |
| Eger | 39.3 | 12.7 | 499 |
Let’s say that someone was debating whether to pursue a career as a software developer or as a teacher. Assuming that the pros and the negatives in both choices cancel out and the only deciding factor remaining is the salary, they might decide on the career path which leads to a higher salary on average. Assuming that the two samples are of normal distribution, to test if a software developer has a higher average income than a teacher, we first employ an F-test to test if the variance of the two distributions is the same, then we use a t-test to test if the mean of one of the distributions is statistically significantly higher or lower.
dev_income <- data$income[data$occupation == "Software Developer"]
teacher_income <- data$income[data$occupation == "Teacher"]##
## F test to compare two variances
##
## data: dev_income and teacher_income
## F = 1.012, num df = 578, denom df = 1098, p-value = 0.8633
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8789816 1.1689667
## sample estimates:
## ratio of variances
## 1.012039
According to the results, where the p-value (0.8633) is no way near our target of 0.05, we conclude that we did not find significant evidence to state that the two variances are not equal. Thus, we can not disregard the null hypothesis that the two variances are equal. Note, that this not mean that we found significant evidence that the two variances are the same, but that we did not find strong evidence for the contrary.
According to our preliminary test, we continue with a two sample t-test, with the added presumption that the two variances are equal.
t_test_result <- t.test(dev_income, teacher_income, alternative = "greater", var.equal=TRUE)
print(t_test_result)##
## Two Sample t-test
##
## data: dev_income and teacher_income
## t = 38.549, df = 1676, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 107412.9 Inf
## sample estimates:
## mean of x mean of y
## 682597.9 570394.7
From the results of the t-test we can see that the p-value is vanishingly small (2.2e-16), from which we conclude to disregard the null hypothesis. Thus, we state, with statistically significant evidence, that the average income of a software developer is higher than that of a teacher in Hungary.
income_by_occupation <- data %>%
filter(occupation %in% c("Software Developer", "Teacher")) %>%
group_by(occupation) %>%
summarise(
mean_income = mean(income),
sd_income = sd(income),
n = n(),
.groups = "drop"
)
kable(income_by_occupation,
caption = "Income Statistics by Occupation",
digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| occupation | mean_income | sd_income | n |
|---|---|---|---|
| Software Developer | 682598 | 56903 | 579 |
| Teacher | 570395 | 56564 | 1099 |
ggplot(data %>% filter(occupation %in% c("Software Developer", "Teacher")),
aes(x = occupation, y = income, fill = occupation)) +
geom_boxplot(alpha = 0.7) +
scale_fill_viridis_d() +
labs(title = "Income Distribution: Software Developers vs Teachers",
x = "Occupation",
y = "Income (HUF)") +
custom_themeTo investigate if there is a significant difference in income between cities we employ a chi-squared test. We first create income categories based on the quantiles of the income distribution: Low (0-25%), Medium-Low (25-50%), Medium-High (50-75%), High (75-100%). Then we apply the chi-squared test to the data. The test will show if the distribution of income categories is independent of city or if there’s a significant association. The visualizations help us understand the nature of these relationships by showing the proportion of each income category within each city.
data <- data %>%
mutate(income_category = cut(
income,
breaks = quantile(income, probs = seq(0, 1, 0.25)),
labels = c("Low", "Medium-Low", "Medium-High", "High"),
include.lowest = TRUE)
)
city_chi <- chisq.test(table(data$city, data$income_category))
print(city_chi)##
## Pearson's Chi-squared test
##
## data: table(data$city, data$income_category)
## X-squared = 2624.1, df = 21, p-value < 2.2e-16
ggplot(data, aes(x = city, fill = income_category)) +
geom_bar(position = "fill") +
scale_fill_viridis_d() +
labs(title = "Income Distribution by City",
x = "City",
y = "Proportion",
fill = "Income Category") +
custom_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The chi-squared test results show a very strong relationship between city and income distribution (p-value < 2.2e-16). This extremely small p-value indicates that we can reject the null hypothesis with very high confidence. In other words, there is a statistically significant association between the city where someone works and their income category. Looking at the visualization, we can observe that Budapest has a notably higher proportion of high-income earners compared to other cities, while other cities such as Eger and Miskolc have the majority of their population composed of low-income earners.
To further explore the components that make up the income distribution, we employ a chi-squared test to test if there is significant association between the occupation and the income. Our null hypothesis is that the distribution of income categories is independent of occupation.
##
## Pearson's Chi-squared test
##
## data: table(data$occupation, data$income_category)
## X-squared = 2736.5, df = 27, p-value < 2.2e-16
ggplot(data, aes(x = occupation, fill = income_category)) +
geom_bar(position = "fill") +
scale_fill_viridis_d() +
labs(title = "Income Distribution by Occupation",
x = "Occupation",
y = "Proportion",
fill = "Income Category") +
custom_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The chi-squared test results show an extremely strong relationship between occupation and income distribution (p-value < 2.2e-16). This extremely small p-value indicates that we can reject the null hypothesis with very high confidence, meaning there is a statistically significant association between a person’s occupation and their income category.
Looking at the visualization, we can observe that: 1. Software Developers and Doctors have a much higher proportion of high-income earners compared to other occupations 2. Sales Representatives and Drivers tend to have a higher proportion of lower-income categories 3. The income distribution varies significantly across different occupations, highlighting the impact of career choice on earning potential
From the following figure we can see how the age of an individual contributes to their income, for man and woman. It looks as though the older some gets the higher their income will be on average. However, after a certain age it plateaus and even decreases for the really elderly working age.
ggplot(data, aes(x = age, y = income, color = gender)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = TRUE) +
scale_color_viridis_d() +
labs(title = "Relationship between Age and Income",
x = "Age",
y = "Income (HUF)") +
custom_themeThe previous regression model only considered the age as a factor, in the following model we will consider all variables. However, before constructing a model we need to convert categorical data like city, occupation, and gender to factor variables. After converting the data we can fit a linear model that considers all variables.
# convert categorical variables to factors
data$city <- as.factor(data$city)
data$occupation <- as.factor(data$occupation)
data$gender <- as.factor(data$gender)
# fit model
model1 <- lm(income ~ age + I(age^2) + city + occupation + gender, data = data)
model_summary <- summary(model1)
kable(tidy(model_summary), caption = "Multiple Linear Regression Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 433660.30592 | 2924.514442 | 148.28455 | 0 |
| age | 8543.37666 | 141.091922 | 60.55185 | 0 |
| I(age^2) | -79.78386 | 1.679722 | -47.49824 | 0 |
| cityDebrecen | -50120.85179 | 881.049977 | -56.88764 | 0 |
| cityEger | -100597.32872 | 1216.712130 | -82.67965 | 0 |
| cityGyőr | -98769.04300 | 1133.472750 | -87.13844 | 0 |
| cityMiskolc | -99836.93439 | 1058.419964 | -94.32639 | 0 |
| cityPécs | -98771.68876 | 1114.154184 | -88.65172 | 0 |
| citySzeged | -49273.11217 | 990.151081 | -49.76323 | 0 |
| citySzombathely | -99950.31290 | 1341.130260 | -74.52692 | 0 |
| occupationChef | -42337.58069 | 1523.557071 | -27.78864 | 0 |
| occupationDoctor | 59638.84216 | 1541.323963 | 38.69326 | 0 |
| occupationDriver | -50997.27172 | 1527.740727 | -33.38084 | 0 |
| occupationEngineer | 17923.05321 | 1244.335517 | 14.40371 | 0 |
| occupationManager | 37864.23296 | 1344.885670 | 28.15424 | 0 |
| occupationNurse | -32122.23466 | 1182.070408 | -27.17455 | 0 |
| occupationSales Representative | -60634.90344 | 1052.592704 | -57.60529 | 0 |
| occupationSoftware Developer | 88290.91287 | 1329.193817 | 66.42441 | 0 |
| occupationTeacher | -22562.66376 | 1123.416098 | -20.08398 | 0 |
| genderMale | 18906.14278 | 579.494893 | 32.62521 | 0 |
From the above results we can see that, according to the fitted linear model, all variables are significant towards predicting the income of an individual
# polynomial regression
model2 <- lm(income ~ poly(age, 3), data = data)
# static plot
ggplot(data, aes(x = age, y = income)) +
geom_point(alpha = 0.1, color = income_palette[1]) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3),
color = income_palette[5], fill = income_palette[5], alpha = 0.2) +
labs(title = "Polynomial Regression: Age vs Income",
subtitle = "Cubic polynomial fit with confidence interval",
x = "Age",
y = "Income (HUF)") +
custom_theme# model comparison
model_comparison <- data.frame(
Model = c("Multiple Linear", "Polynomial"),
R_squared = c(summary(model1)$r.squared, summary(model2)$r.squared),
Adj_R_squared = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared)
)
kable(model_comparison, caption = "Model Comparison") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Model | R_squared | Adj_R_squared |
|---|---|---|
| Multiple Linear | 0.8863107 | 0.8860170 |
| Polynomial | 0.1610902 | 0.1607487 |
What is the predicted income of a 35 years old male software developer working in Budapest according to the first model?
new_data <- data.frame(
age = 35,
city = "Budapest",
occupation = "Software Developer",
gender = "Male"
)
# Predict income
prediction <- predict(model1, newdata = new_data, interval = "prediction")
cat("Predicted income of a", new_data$age, "years old", new_data$gender, "living in", new_data$city, "working as a", new_data$occupation, "is", prediction[1])## Predicted income of a 35 years old Male living in Budapest working as a Software Developer is 742140.3
What is the predicted income of a 35 years old male software developer working in Budapest according to the second model?
new_data <- data.frame(
age = 35,
city = "Budapest",
occupation = "Software Developer",
gender = "Male"
)
# Predict income
prediction <- predict(model2, newdata = new_data, interval = "prediction")
cat("Predicted income of a", new_data$age, "years old", new_data$gender, "living in", new_data$city, "working as a", new_data$occupation, "is", prediction[1])## Predicted income of a 35 years old Male living in Budapest working as a Software Developer is 584359.7
We conclude this study with the following takeaways:
By the population pyramid we saw a visual representation of the age distribution in Hungary. From the diagram we conclude that the fertility rate is lower than expected, as there are very few young children.
From the age-income plot we saw that there are three groups of people: young/unemployed, working age, retired elderly.
Based on the outcome of the age-income plot we employed a clustering algorithm (K-means) to split the data according to the aforementioned demographic groups.
We employed some other commonly used descriptive statistics to visualize the data that we are working with, to inform the questions that might be interesting to investigate.
We found significant evidence that the average income of a man working in Hungary is higher than the average income of a woman working in Hungary.
We could not find evidence to claim that the mean age in other cities of Hungary is different than the mean age in Budapest. Thus we accept that the average age is about the same across different cities of Hungary.
We found significant evidence that the average salary of a software developer, regardless of age, gender, and city, is higher than that of a teacher.
We found significant evidence that the income of an individual working in Hungary is not independent from the city that they work in.
We found significant evidence that the income of an individual working in Hungary is not independent from the occupation that they are employed in.
We fitted two models for predicting the income based on the age of an individual. One that has different predictions based on gender, and one that is gender-neutral.
We fitted a linear model that takes into account all available variables to predict the income of an individual. From the fitted model we concluded that all variables play a significant role in predicting the income of the person.